Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is
readablemaintainablereproducibleCore packages ggplot2, dplyr, tidyr, readr, broomOur DatasetSource: Behavioral Risk Factor Surveillance System (BRFSS) 2015.
Key Features: Health indicators related to diabetes, including:
:::
What are the key predictive variables in diabetes prognosis?
How does gender influence the manifestation and progression of diabetes?
Removed Missing Values: df_cleaned <- df |> drop_na()
Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))
Filtered Incorrect Values: Filtered out rows with values outside expected ranges.
Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).
Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.
Socio-Economic Class: Derived from income, education, and healthcare status.
Between all variables: health related variables correlated between them. Not highly negatively correlated variables. GenHlth and Income negatively correlated.
With the target variable: GenHlth, HighBP and BMI most correlated with diabetes variable.
All variables: Creation of a GLM with all numerical variables.
Step: Step forward and backward for best variables selection.
Results: Lowest AIC achieved with backward model (contains 19 variables). Excluded variables from the full model are Smoker, AnyHealthcare, NoDocbcCost and Education_binary.
Selected components: 15 components that reach 80% of explained variability.
Logistic regression: Use of those components to perform a diabetes prediction model.
Results: Great accuracy with a value of 87%.
Men VS. Women: Creation of two different datasets according to sex.
Results: Better performance in Men model due to lowest AIC. More importance to general health variables and also to fruit variable. Much better performance than the GLM from first part of analysis.
Analysis part1 results (same as part above)
Analysis part2 results (same as part above)
Discussion and key takeaways